Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, the materials will be available for download at QUERCUS and also distributed via email. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include an HTML version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark).
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
Welcome to this fifth lecture in a series of seven. Today we're going to branch off into the wonderful world of flow control and how you can really make your code work for you.
At the end of this lecture we will aim to have covered the following topics:
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
pandas provides the DataFrame class that allows us to format and play with data in a tabular format.
time provides various time-related functions.
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
Flow controls are programs that allow us to repeat a task over and over until there are no more iterations to perform or until a condition that we set is not met anymore. This means that, with a few lines of code, you can perform tasks that otherwise require you copying/pasting your code hundreds or thousands of times.
Flow control is one of the most important skills to have in your computer programming toolbox. All programming languages use them and while the logic behind them is very similar the syntax to write the code differs. Under the hood of Python, its packages, methods and functions all have some form of flow control implemented - especially in the cases where it seems like a single command is accomplishing a lot.
Having a good understanding of data subsetting, logical, conditional, and comparison operators, is critical to writing flow control programs. Thus, we will start off this lecture with a recap of some those concepts from previous lecture.
| Comparing the use of flow control for a program or scripts vs a linear sequence of code. Image from https://codewithlogic.wordpress.com/2013/09/01/python-basics-understanding-the-flow-control-statements/ |
As we saw last week, for instance, we were able to take advantage of the groupby() method to organize our DataFrame data based on categories. We may wish, for instance to look at these groups indvidiually to decide if they merit further analysis or visualization. If your number of groups is rather small, you could manually curate this data. When dealing with much larger data sets, it would be in your best interests to automate this using the ideas of flow control.
In the above example, we would describe the process as iterating through your DataFrame. At each iteration you are using a branching statement to determine if a primary or secondary analysis should be performed. We'll learn throughout this lecture that there are a number of syntax patterns used to iterate or loop through your data as well as a number of predefined forms of conditional or branching statements. Here's are some helpful tables to summarize what we'll cover today:
| Statement | Description | Syntax |
|---|---|---|
| for loop | Used to iterate through a range, list, or other iterable data structure from start to end. | for item in iterable: statement |
| while loop | Used to iterate through a range, list, or other iterable data structure as long as a conditional expression remains true at the start of each iteration. |
while condition: statement |
| if | Begin a branching statement to runs specific code if a conditional is met. | if condition: statement |
| elif | Used to extend the if statement as an alternative condition/action pairing. | elif condition: statement |
| else | Used as a catch-all action to perform if no conditionals evaluate to true. | else: statement |
| break | Used to completely exit a looping statement. Usually used within a conditional. | if condition: break |
| continue | Used to end the current iteration of a looping structure but continue with the next. Usually used within a conditional. |
if condition: continue |
One key aspect about these type of operators is that their output is boolean (True or False), and those outputs can be used to perform a wide range of operations. Here are some comparison operators.
We've already seen the logical operators in previous lectures. They're used to generate logical expressions that we use to filter values or set conditions for further steps. We'll even use these to determine the branching of code (ie Control of flow). Here's a table briefly summarizing these operators:
| Operator | Description |
|---|---|
| > | Greater than |
| >= | Greater than or equal to |
| < | Less than |
| <= | Less than or equal to |
| == | Equivalent values (but not necessarily equivalent objects in memory |
| != | Inequality or dissimilar values |
These are quite straight-forward to work with for integers or floats.
# Less than
5 < 3
# Equivalent
10 == 9
# Dissimilar
24 != 24
# Less than or equal to
8 <= 15
False
False
False
True
The rules for using logical operators on strings is slightly different versus integers. When comparing strings, the following procedure is followed:
Here's a Unicode table to help us out with our interpretation.
Let's give it a try shall we?
# Can a string be less than itself?
"carrot" < "carrot"
# What about Capitalization?
"Carrot" < "carrot"
False
True
# character position matters too!
"cArrot" < "Carrot"
# Does overall Unicode "value" matter?
"Barrot" < "CARROT"
False
True
# What if we have a longer string?
"CARROTcake" < "CARROT"
# What if we have a longer string with unequal values?
"BARROTcake" < "CARROT"
# What about just adding a space?
"CARROT " == "CARROT"
False
True
False
We've seen that we can compare integers with integers and how strings can be compared but we can't simply compare dissimilar object types. So no comparing apples to sheep - they just don't stack up.
# Compare a string to an integer
"car" < 4
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-11-bab279a6e2d4> in <module> 1 # Compare a string to an integer ----> 2 "car" < 4 TypeError: '<' not supported between instances of 'str' and 'int'
# What about a string version of an integer?
"5" < 4
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-12-939b598c5dc5> in <module> 1 # What about a string version of an integer? ----> 2 "5" < 4 TypeError: '<' not supported between instances of 'str' and 'int'
# What if we cast a string to an int? or vice versa?
float("5") < 4
str(4) < "5"
int("car") < 4
False
True
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-13-5eaa76852175> in <module> 5 str(4) < "5" 6 ----> 7 int("car") < 4 ValueError: invalid literal for int() with base 10: 'car'
Remember that objects must also be of the same size to complete a comparison using Python's built-in operators. Furthermore, logical comparison between list objects can be complicated. Comparison of lists uses lexicographical order by comparing elements at each index, beginning with index = 0. As elements are compared, they must also follow the previous rules we've outlined.
If you want to retrieve the results of a proper element-wise comparison you'll have to use something like the Numpy package.
var = [1, 4 , 6, 9]
var > 10
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-14-f172ce161b6b> in <module> 1 var = [1, 4 , 6, 9] 2 ----> 3 var > 10 TypeError: '>' not supported between instances of 'list' and 'int'
# Just a single element of var is smaller
var < [1, 4, 6, 10]
# It shouldn't be equivalent
var == [1, 4, 6, 10]
# These should be the same
var == [1, 4, 6, 9]
# A single element is larger
var < [1, 3, 6, 9]
True
False
True
False
# Some comparisons can be completed but only a single answer is returned
[1, 2, 3, 4] == [1, "two", 3, 4]
# But other operators will have trouble making the comparison between incomparables
[1, "2", 3, 4] < [1, "two", 3, 4]
# Will this work?
[1, 2, 3, 4] > [1, "two", 3, 4]
False
True
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-16-226f2280213e> in <module> 6 7 # Will this work? ----> 8 [1, 2, 3, 4] > [1, "two", 3, 4] TypeError: '>' not supported between instances of 'int' and 'str'
array¶Recall that the numpy.array object has the ability to broadcast operations and perform mathematical expressions on like-sized arrays. The same applies to conditional expression. By converting our list of numbers to an array object, we can perform conditional expressions against a scalar (single value) or against other arrays. The output, of course, is a boolean array of the original size.
# Needs to be a numpy array to compare to a single integer (array adds element-wise capability)
import numpy as np
var = np.array([1, 4, 6, 9])
var > 10
# and right side must be of length 1 or have the same shape
var < [3,4,6,3]
array([False, False, False, False])
array([ True, False, False, False])
The boolean operators are used for combining True and False values that can come in various formats. We've already come across some examples last lecture when we were filtering our data. Boolean operators can be used to combine logical expressions, variables, or both. We have four operators at our disposal to combine or compare boolean (logical) and non-boolean (bitwise) values.
| Operator | Description | Evaluation rules |
|---|---|---|
| and | Logical AND results in True only when all comparisons are True | True and True = True |
| True and False = False | ||
| False and False = False | ||
| & | Bitwise AND compares the binary values of an integer at every bit | 1010 1010 & 0101 0101 = 0000 0000 |
| or | Logical OR results in False only when all comparisons are False | True or True = True |
| True or False = True | ||
| False or False = False | ||
| | | Bitwise OR compares the binary values of an integer at every bit | 1010 1010 | 0101 0101 = 1111 1111 |
Recall that integers can be converted to booleans with any value other than 0 being considered as True. So we can also use bitwise comparison on these booleans although and and or are more appropriate.
Note also that the logical operators (<, >, ==, etc.) take a higher order precedence than and and not when being evaluated within an expression. Conversely bitwise & and | take higher precedence than the logical operators so appropriate use of parentheses () will be required.
# Testing logical AND
True and True # True and True = True
(4 < 6 and 4 < 8) # True and True = True
(4 > 6 and 7 < 8) # False and True = False
(4 > 6 and 7 > 8) # False and False = False
(4 < 6 and 7 > 8) # True and False = False
True
True
False
False
False
# Testing logical OR
True or True # True or True = True
( 4 < 6 or 4 < 8) # True or True = True
( 4 > 6 or 7 < 8) # False or True = True
( 4 > 6 or 7 > 8) # False or False = False
( 4 < 6 or 7 > 8) # True or False = True
True
True
True
False
True
not to negate your boolean values¶The final operator we'll review is the logical NOT. This is a unary operator that can evaluate a single input and returns the opposite boolean value to that input. This can be used to negate the boolean evaluation from a logical expression. This can be especially useful when generating conditional statements that will determine which parts of your code are run (ie control of flow).
# Simple case of using not
not False # True
# You can even use it on integers
not 1 # False
not 0 # True
True
False
True
# Reverse the result of a logical result
not 7 > 8 # True
not 7 < 8 # False: 7 IS less than 8
# A more complex case with mixed operands
not(7 < 8 and 5 < 4) # not(True AND False) = not(False) = True
not(7 < 8 and 4 < 5) # not(True AND True) = not(True) = False
True
False
True
False
not can be used on non-boolean objects¶Beware: the logical not does not apply across all types of objects.
However, a quick and easy way to determine the status of an object is to use the logical not. It will return False unless the object is empty. This can also be a very useful way to determine of a variable has been assigned to a proper object.
# Try an empty string
not ''
# vs a character or string
not 'a'
# How does the logical NOT handle a list?
[True, False, True, False, False]
# NOT our list
not [False, False, False, False, False]
# What about any kind of list?
not [1, 2, 3]
# Perhaps an empty list?
not []
True
False
[True, False, True, False, False]
False
False
True
| There are a number of way to obtain the same boolean result |
Here ends the recap on logical operators. Time to loop! But first...
# Use this code cell to answer the above comprehension questions.
for loops¶for loops allow you to iteratively perform operations or data manipulation. Their general structure is
for item in iterable:
statement
In the above general structure:
iterable is a Python data structure that contains your data, and can be a list, an array, a data frame, etc. item is the iteration variable, and will take the form of each element that you are iterating over. statement is the set of instructions (addition, multiplication, evaluations, etc.) that you want to perform - usually over iterable. In plain English, it means something like "for every item in iterable, do statement until you reach the last element in iterable.
The last thing to note is the indentation on this for loop. Up until now, we have not really been using any tabbed indentation style in our coding. Normally we use tabbed indentation to help make our code more readable ie by indenting the statements inside a for loop.
Python takes this philosophy to the next step by requiring indentation-as-grammer of statements within a control flow structure to be considered part of that structure. We'll see what that means in upcoming examples.
Here are a some definitions of concepts that we will be using today:
There are several structures that are iterable, including core Python data structures such as lists, tuples, dictionaries, sets, and the non-core structures such as multidimensional Numpy Arrays and Pandas DataFrames. They are all iterable objects. They are iterable containers because you can retrieve iterators from them (https://www.w3schools.com/python/python_iterators.asp).
The job of iterators is to create a "count" or "index" of the elements over which you want to iterate, thus creating a road map to loop over an iterable (a data structure). There are several functions that are meant to be iterators or that can also work as iterators, and which one to use varies with the program that you want to write and what iterator you are working with. Here are some of the most common iterators in Python:
enumerate(iterable, start): Return an enumerate object. iterable must be another object that supports iteration. The enumerate object yields pairs containing a count (from start, which defaults to zero) and a value yielded by the iterable argument. enumerate is useful for obtaining an indexed list.range(stop) or range(start, stop[, step]): Return an object that produces a sequence of integers from start (inclusive) to stop (exclusive) by step (optional). range(i, j) produces i, i+1, i+2, ..., j-1. Much like slicing notation, start defaults to 0, and stop is excluded!
range(4) produces 0, 1, 2, 3. These are exactly the valid indices for a list of 4 elements. When step is given, it specifies the increment (or decrement).iter(iterable) or iter(callable, sentinel): Get an iterator from an object. In the first form, the argument must supply its own iterator, or be a sequence. In the second form, the callable3 is called until it returns the sentinel4.numpy.ndenumerate(arr): Multidimensional index iterator. Return an iterator yielding pairs of array coordinates and values.numpy.ndindex(*shape): An N-dimensional iterator object to index arrays. Given the shape of an array, an ndindex instance iterates over the N-dimensional index of the array. At each iteration a tuple of indices is returned, the last dimension is iterated over first.numpy.nditer(): Efficient multi-dimensional iterator object to iterate over arrays. To get started using this object, you can visit this helpful tutorial.3 A callable is an object that allows you to use round parenthesis ( ). 4 A sentinel value is a condition that indicates the termination of a recursive algorithm.
# enumerate?
# range?
# iter?
# import numpy as np
# np.ndenumerate?
# np.ndindex?
# np.nditer?
That's a lot of functions to digest! Let's step back and break down iteration by looking at the different data structures we know.
The built-in list structure represents a mutable structure that you will likely work with often to iterate through. Let's iterate through some examples.
# For loop on a simple list
for number in [1,2,3]:
print("This is an integer:", number)
print("done!") # Note the indentation of this print() call?
This is an integer: 1 This is an integer: 2 This is an integer: 3 done!
# Generate a list assigned to a variable
list_1 =[1, 2, 3]
# Loop through a variable
for number in list_1:
print("This is an integer:", number)
print("done!")
This is an integer: 1 This is an integer: 2 This is an integer: 3 done!
Sometimes you might want to count the number of iterations that occur within a loop. This may be part of some branching code that we'll look at later. With a simple version of such code you can answer, for example, "how many odd integers are in my list?.
Let's see how a loop can be used to increment a variable's value.
count = 0 # we need to initialize the counter at 0
for item in list_1:
count = count + 1
print("item value:", item)
print("Final count value:", count)
item value: 1 item value: 2 item value: 3 Final count value: 3
Replace 1 by 30, and item by count in the print function, and run the code again. What do you think this code is doing?
count = 0 # We need to intialize the counter at 0
for item in list_1:
count = count + 30
print('Count:', count)
Count: 30 Count: 60 Count: 90
for loop¶Cumulative summation can be done through for loops although we also have the sum() function to accomplish that. Do you wonder how the sum() function actually works?
sum_list = [3, 41, 12, 9, 74, 15]
# Take the sum of this list
print("sum() function:", sum(sum_list), "\n")
total = 0
for item in sum_list:
# in this case we use item (iteration variable)
total = total + item
# Print the total with each loop
print("Updating total:", total)
print('\nFor loop total:', total)
sum() function: 154 Updating total: 3 Updating total: 44 Updating total: 56 Updating total: 65 Updating total: 139 Updating total: 154 For loop total: 154
iterable variable in your loop¶Now, lets try to subset using the iterable variable in our for loop. Remember that values will be assigned to an item variable from our iterable with each passing loop.
# Set up a list of strings
photosynthesis_types = ['C3', 'C4', 'CAM', 'Anoxygenic', 'Oxygenic']
# Use our list as the iterable in our for loop
for x in photosynthesis_types:
# Subset our list using values from our list
print(photosynthesis_types[x])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-31-3c35c1ec36a4> in <module> 5 for x in photosynthesis_types: 6 # Subset our list using values from our list ----> 7 print(photosynthesis_types[x]) TypeError: list indices must be integers or slices, not str
Python is not happy about it... It says that needs to be integers or slices, so let's try with integers to see how it behaves. Recall what we know about lists. Can we subset or slice a list using string values?
# Make a list of integers
list_2 = [1, 2, 3, 4, 5]
# Use our integer list in our for loop
for x in list_2:
# Subset our integer list using the values from it
print(list_2[x])
2 3 4 5
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-32-5f2505ae03e5> in <module> 5 for x in list_2: 6 # Subset our integer list using the values from it ----> 7 print(list_2[x]) IndexError: list index out of range
Look at what happened above. We ended up accessing and element outside the range of the indices in our list. Even though we had a sequence set of integers, we started at 1 and lists are zero-indexed. Simply using the values from the list isn't the correct way to iterate through it either. What we really want is a list of integer values starting at 0 and going to the length of our list.
Here is where iterators play an important role. Before jumping into iterators, let's see what happens when we pass photosynthesis_type to the len() function.
# it will not work because len() returns an integer, not a range
for x in len(photosynthesis_types):
print(photosynthesis_types[x])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-33-e9f070694dc3> in <module> 1 # it will not work because len() returns an integer, not a range 2 ----> 3 for x in len(photosynthesis_types): 4 print(photosynthesis_types[x]) TypeError: 'int' object is not iterable
That didn't work either.
range() function as a for loop interator¶Okay, we've looked at a lot of ways on how not to make a for loop. Remember, the for loop by itself does not know what to do with a single integer (the output of len()). Instead, let's use the range() function. Recall that the default behaviour when unary input is provided, is to calculate (0, stop] where we use ( to denote inclusivity and ] to denote exclusivity.
The range() function, of course, returns an iterable. Let's give it a try!
# Supply our list to range()
for x in range(photosynthesis_types):
print(photosynthesis_types[x])
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-34-931a7f8fa6d9> in <module> 1 # Supply our list to range() ----> 2 for x in range(photosynthesis_types): 3 print(photosynthesis_types[x]) TypeError: 'list' object cannot be interpreted as an integer
The above TypeError means that range() has no idea what to do with a list if no integers (as indices) are supplied. What if we combine range() and len()?
# How to properly use range
range(len(photosynthesis_types))
# What object does it return?
type(range(len(photosynthesis_types)))
range(0, 5)
range
# Remember range takes in an integer value as input
for x in range(len(photosynthesis_types)):
print(photosynthesis_types[x])
C3 C4 CAM Anoxygenic Oxygenic
Now it works. The loop now has an index to iterate ("from 0 to the last item")
So to summarize, we've used a for loop:
range(len(list)) combination to achieve an iterable set of indices.What about other data structures?
Recall that Numpy arrays are not built-in data structures. While they share a lot of visual and conceptual similarities to Python list objects they are not the same. Instead they include functions for generating iterators from these Numpy objects. As a note, all packages that produce iterable objects should include basic methods like __iter__ that Python expects to find when provided to something like a for loop.
A numpy.array object returns an iteration that behaves very much like a list so each invididual element is returned in the iterator. That being said, there can be differences in the behaviours between objects and these can factors influence how you should write your code.
import numpy as np
# 1d numpy array
array_1 = np.array([1,2,3,4,5])
array_1
array([1, 2, 3, 4, 5])
# Iterate through a for loop using the array_1 iterator
for i in array_1:
# Just print the entire array
print(array_1)
[1 2 3 4 5] [1 2 3 4 5] [1 2 3 4 5] [1 2 3 4 5] [1 2 3 4 5]
# Iterate through elements of the array
for i in array_1:
# Print the elements from array_1. What will happen here?
print(array_1[i])
2 3 4 5
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-41-284bf47791bb> in <module> 2 for i in array_1: 3 # Print the elements from array_1. What will happen here? ----> 4 print(array_1[i]) IndexError: index 5 is out of bounds for axis 0 with size 5
Remember that arrays are data structures combined for broadcasting. That means we can do things like multiply across elements, replace or fill multiple elements at once (in the case of DataFrames).
Challenge
Create a for loop that multiplies the first four digits of array_1 by 4. Store each iteration in an object called iteration.
# Two ways to access the first 4 digits of our array
array_1 = np.array([0,1,2,3,4,5])
# The long way around! Take the length of the array but then only the first four elements.
for i in range(len(array_1))[0:4]:
iteration = array_1[i] * 4
print(iteration)
0 4 8 12
# Knowing you only want the first 4 elements, why not just make a proper range?
for i in range(4):
iteration = array_1[i] * 4
print(iteration)
0 4 8 12
nditer() function¶So far we've only been generating a single iterator using the base behaviours of the for loop. However, we can use functions that return multiple iterables to us. In turn that can provide multiple iterators to a loop after assigning them to variables. This idea also occurs when converting dictionary.items() to an iterator (See Section 6: Appendix 1).
For the numpy package, we have a way to produce multiple iterables with the nditer() function. It can take in one or more array objects and return iterables for each - in the form of a tuple OR as separate iterables.
array_2 = np.array([6, 7, 8, 9, 10, 11])
array_1
array_2
for x, y in np.nditer([array_1, array_2]):
# This includes a little bit of string formatting magic that we'll discuss in an upcoming lecture
print("%d:%d" % (x,y), end = " ")
array([0, 1, 2, 3, 4, 5])
array([ 6, 7, 8, 9, 10, 11])
0:6 1:7 2:8 3:9 4:10 5:11
nditer() to help broadcast between dissimilar array sizes¶Another feature for nditer() is in how it handles the production of iterators for multidimentional arrays and the idea of broadcasting. Suppose instead of two 1x6 arrays, one of our arrays was two-dimensional? With arrays and the right coding, we can broadcast across rows. Just be sure the sizes match properly or you'll receive an error.
# Make a 1D array of values 6-23, then reformat the shape to 3x6
array_3 = np.array(np.arange(6, 24)).reshape(3,6)
array_1
array_3
# We know that array_1 and array_3 are not the same size but they do share the same width dimension
for x, y in np.nditer([array_1, array_3]):
# We'll broadcast in a row-wise manner
print("%d:%d" % (x,y), end = " ")
array([0, 1, 2, 3, 4, 5])
array([[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
0:6 1:7 2:8 3:9 4:10 5:11 0:12 1:13 2:14 3:15 4:16 5:17 0:18 1:19 2:20 3:21 4:22 5:23
# Make just 3 values.
array_4 = np.array(np.arange(0,3))
array_4
array_3
# Will this code recycle our array_1 values?
for x, y in np.nditer([array_4, array_3]):
print("%d:%d" % (x,y), end = " ")
array([0, 1, 2])
array([[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23]])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-50-00a621ff5f20> in <module> 6 7 # Will this code recycle our array_1 values? ----> 8 for x, y in np.nditer([array_4, array_3]): 9 print("%d:%d" % (x,y), end = " ") ValueError: operands could not be broadcast together with shapes (3,) (3,6)
DataFrames¶So we've spent the last 3 lecture touching on or working explicitly with DataFrame objects. How does looping over these compare to lists, or even arrays? Recall these are 2D structures of tabulated data, suggesting there is organization of some sort across rows and columns.
Let's import subset_taxa_metadata_merged.csv as data. As we start looping over the file, we'll also quickly recap on importing and subsetting data frames.
import pandas as pd
# Read in subset_taxa_metdata_merged.csv
data = pd.read_csv('data/subset_taxa_metadata_merged.csv')
data.head()
data.info()
# What is the sum of the count column?
data["count"].sum()
| OTU | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_14158 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 | 0 |
| 1 | OTU_97_14062 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium | 700114707 | 764831721 | 2 | Male | WUGC | Oral | Saliva | SRS062500 | 0 |
| 2 | OTU_97_6312 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700015167 | 158883629 | 1 | Female | BCM,BI | Airways | Anterior Nares | NaN | 0 |
| 3 | OTU_97_11576 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | 700113117 | 764467579 | 2 | Female | WUGC | Skin | Left Antecubital Fossa | SRS063079 | 0 |
| 4 | OTU_97_29218 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700103612 | 765094712 | 2 | Male | WUGC | Oral | Palatine Tonsils | SRS042263 | 0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12502 entries, 0 to 12501 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OTU 12502 non-null object 1 SUPERKINGDOM 12502 non-null object 2 PHYLUM 12338 non-null object 3 CLASS 12260 non-null object 4 ORDER 12175 non-null object 5 FAMILY 11213 non-null object 6 GENUS 10123 non-null object 7 PSN 12502 non-null int64 8 RSID 12502 non-null int64 9 VISITNO 12502 non-null int64 10 SEX 12502 non-null object 11 RUN_CENTER 12502 non-null object 12 HMP_BODY_SITE 12502 non-null object 13 HMP_BODY_SUBSITE 12502 non-null object 14 SRS_SAMPLE_ID 11773 non-null object 15 count 12502 non-null int64 dtypes: int64(4), object(12) memory usage: 1.5+ MB
2470
for loops to import large datasets in smaller chunks¶Assuming that you only need to access parts of a file at one time to gather summary information, you can break down large files that will not fit in memory by importing it in smaller chunks. This saves memory and potentially time as you don't have to wait for the whole file to load. Or if information in the file is treated independently between lines or sections - like large sequencing files, you can work with the data in smaller bites.
Luckily for us, the read_csv() function has a parameter chunksize that we can use to set how many lines we'd like in each chunk. By activating this parameter, the function read_csv() automatically returns an iterable object called a TextFileReader.
# another way to import a large file
# We need a place to put the resulting information
result = []
# Build a for loop by telling the read_csv to break into chunks.
for chunk in pd.read_csv('data/subset_taxa_metadata_merged.csv', chunksize=100):
# this operation will be applied to every chunk, and then all the individual results are added up
result.append(sum(chunk['count']))
total = sum(result) # sum chunk results in result
print(total)
2470
# same as
total = 0 # instead of empty, we initalize with 0 and just add elemnts to that 0
for chunk in pd.read_csv('data/subset_taxa_metadata_merged.csv', chunksize=100):
total += chunk['count'].sum() # += assigns value to total and also appends on every iteration
print(total)
2470
Here we will use the concat() method to grow a DataFrame by rows.
dataLoop = pd.DataFrame()
for chunk in pd.read_csv('data/subset_taxa_metadata_merged.csv', chunksize=100):
dataLoop = pd.concat([dataLoop, chunk], axis = 0)
dataLoop.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12502 entries, 0 to 12501 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OTU 12502 non-null object 1 SUPERKINGDOM 12502 non-null object 2 PHYLUM 12338 non-null object 3 CLASS 12260 non-null object 4 ORDER 12175 non-null object 5 FAMILY 11213 non-null object 6 GENUS 10123 non-null object 7 PSN 12502 non-null int64 8 RSID 12502 non-null int64 9 VISITNO 12502 non-null int64 10 SEX 12502 non-null object 11 RUN_CENTER 12502 non-null object 12 HMP_BODY_SITE 12502 non-null object 13 HMP_BODY_SUBSITE 12502 non-null object 14 SRS_SAMPLE_ID 11773 non-null object 15 count 12502 non-null int64 dtypes: int64(4), object(12) memory usage: 1.5+ MB
pop() and insert()¶We'll quickly take our current DataFrame and move the count column over to the second position for easier visibility. We can use the pop() method to remove and retrieve the column, and the insert() method to place it back to where we want it. This will alter the DataFrame object in-place.
The pop() method takes the form of pop(item) where item is the name of the column to be removed.
The insert() method takes the form of insert(loc, column, value) where:
loc: is the zero-indexed position you want to insert at.column: is the name of the column after insertion.value: a scalar, Series or array that will be inserted.Let's move that column now shall we?
# Insert the popped `count` column in a single line of code
data.insert(loc = 1, column = "count", value = data.pop('count'))
# Check that it worked
data.head()
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_14158 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 |
| 1 | OTU_97_14062 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium | 700114707 | 764831721 | 2 | Male | WUGC | Oral | Saliva | SRS062500 |
| 2 | OTU_97_6312 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700015167 | 158883629 | 1 | Female | BCM,BI | Airways | Anterior Nares | NaN |
| 3 | OTU_97_11576 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | 700113117 | 764467579 | 2 | Female | WUGC | Skin | Left Antecubital Fossa | SRS063079 |
| 4 | OTU_97_29218 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700103612 | 765094712 | 2 | Male | WUGC | Oral | Palatine Tonsils | SRS042263 |
Recall there are a number of ways to subset a DataFrame object. We'll focus mainly on the multi-indexing methods which include:
loc[]: use index and column names in [row, column] notation. iloc[]: use index numbers in [row, column] notation. .colName: access the column names as attributes.[ ] within the loc[] or iloc[] methods for a DataFrame object, the resulting object returned will also be a DataFrame. Otherwise, it will be a Series object. : can be used with loc[] and iloc[] methods loc[] and iloc[] can accept a boolean Series to subset rows from the DataFrame as long as the dimensions correctly match.& (logical AND) and | (logical OR) to combine element-wise.For a brief recap of examples, see Section 7.0.0: Appendix 2.
Let's take our current dataset, data, and select only data where the GENUS is Streptococcus with a count > 0.
streptococcus = data.loc[(data["GENUS"] == "Streptococcus") & (data["count"] > 0)]
# Take a peek
streptococcus.head()
# How big is our filtered dataset?
streptococcus.info()
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | OTU_97_39456 | 17 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700098366 | 158458797 | 2 | Female | BCM | Oral | Hard Palate | SRS022494 |
| 536 | OTU_97_34962 | 3 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700033734 | 159713063 | 1 | Female | BCM,WUGC | Oral | Buccal Mucosa | SRS011661 |
| 686 | OTU_97_12398 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700099597 | 158438567 | 2 | Male | BCM | Oral | Hard Palate | SRS023436 |
| 976 | OTU_97_23768 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700113110 | 764467579 | 2 | Female | WUGC | Oral | Supragingival Plaque | SRS062514 |
| 1237 | OTU_97_38024 | 6 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 |
<class 'pandas.core.frame.DataFrame'> Int64Index: 34 entries, 192 to 12148 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OTU 34 non-null object 1 count 34 non-null int64 2 SUPERKINGDOM 34 non-null object 3 PHYLUM 34 non-null object 4 CLASS 34 non-null object 5 ORDER 34 non-null object 6 FAMILY 34 non-null object 7 GENUS 34 non-null object 8 PSN 34 non-null int64 9 RSID 34 non-null int64 10 VISITNO 34 non-null int64 11 SEX 34 non-null object 12 RUN_CENTER 34 non-null object 13 HMP_BODY_SITE 34 non-null object 14 HMP_BODY_SUBSITE 34 non-null object 15 SRS_SAMPLE_ID 29 non-null object dtypes: int64(4), object(12) memory usage: 4.5+ KB
What if we use and on this data frame and see what happens
streptococcus['count'] <= 5 and streptococcus['count'] > 16
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-57-46ddac9c9905> in <module> ----> 1 streptococcus['count'] <= 5 and streptococcus['count'] > 16 ~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self) 1533 @final 1534 def __nonzero__(self): -> 1535 raise ValueError( 1536 f"The truth value of a {type(self).__name__} is ambiguous. " 1537 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# What about a bitwise operator?
streptococcus['count'] <= 5 & streptococcus['count'] > 16
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-58-da1eb8e3d22f> in <module> 1 # What about a bitwise operator? ----> 2 streptococcus['count'] <= 5 & streptococcus['count'] > 16 ~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self) 1533 @final 1534 def __nonzero__(self): -> 1535 raise ValueError( 1536 f"The truth value of a {type(self).__name__} is ambiguous. " 1537 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# Remember order of precedence, you'll need some () to evaluate properly
(streptococcus['count'] <= 5) & (streptococcus['count'] > 16)
192 False 536 False 686 False 976 False 1237 False 1679 False 1789 False 2156 False 2327 False 2545 False 2641 False 3196 False 3414 False 5628 False 5829 False 5925 False 6064 False 6333 False 6796 False 6999 False 7244 False 8016 False 8950 False 9109 False 9376 False 9641 False 10338 False 10595 False 10747 False 10955 False 11775 False 11973 False 12067 False 12148 False Name: count, dtype: bool
logical_()* functions as element-wise boolean operators¶In the above example we were simply trying to combine the boolean outputs between what would normally be two pandas Series objects. However Python could not determine the truth value of this object. Instead, we need to turn to the function logical_and(), logical_or() and logical_not() to accomplish our task. They have the same behaviour as their Python counterparts but are able to properly handle the multi-dimensional data of these objects.
These functions are also distinguished from the bitwise operators & (AND), | (OR) and ~ (NOT) mainly by their order of precedence. Remember that these operators will take precedence of evaluation over the logical operands (<, >, ==, etc.).
The logical_*() functions, however, are given a list of boolean statements over which they will element-wise evaluate the expression and return a result. These functions are part of the Numpy package and were designed to work specifically with ndarray objects. Recall that the Series class inherits its behaviours from ndarray.
Let's see some follow-up examples.
# Revisit our first example
streptococcus[np.logical_and(streptococcus['count'] <= 5, # boolean Series
streptococcus['count'] > 16) # Combine with this boolean Series
]
# why are we getting an empty dataset?
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID |
|---|
We know there are streptococcus OTUs with less than 5 and with more than 300 counts. Why are they not showing up in the output?
# Because count cannot be lesser than 5 AND greater than 16 at the same time.
# Is either one OR the other. What would be a more appropiate function to use in this case?
streptococcus[np.logical_or(streptococcus['count'] <= 5,
streptococcus['count'] > 16)
][0:6] # Grab the first 6 rows of the result
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 192 | OTU_97_39456 | 17 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700098366 | 158458797 | 2 | Female | BCM | Oral | Hard Palate | SRS022494 |
| 536 | OTU_97_34962 | 3 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700033734 | 159713063 | 1 | Female | BCM,WUGC | Oral | Buccal Mucosa | SRS011661 |
| 686 | OTU_97_12398 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700099597 | 158438567 | 2 | Male | BCM | Oral | Hard Palate | SRS023436 |
| 976 | OTU_97_23768 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700113110 | 764467579 | 2 | Female | WUGC | Oral | Supragingival Plaque | SRS062514 |
| 1679 | OTU_97_26268 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700114603 | 763901136 | 2 | Female | WUGC | Oral | Palatine Tonsils | SRS062803 |
| 1789 | OTU_97_28039 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700015422 | 158944319 | 1 | Female | JCVI,BI | Oral | Saliva | NaN |
Select streptococcus with less than 10 counts that do not come from Saliva samples
# Two conditions to fill: count values < 10 and subsite is not "Saliva"
streptococcus[np.logical_and(streptococcus['count'] < 10,
streptococcus['HMP_BODY_SUBSITE'] != "Saliva")
]
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 536 | OTU_97_34962 | 3 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700033734 | 159713063 | 1 | Female | BCM,WUGC | Oral | Buccal Mucosa | SRS011661 |
| 686 | OTU_97_12398 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700099597 | 158438567 | 2 | Male | BCM | Oral | Hard Palate | SRS023436 |
| 976 | OTU_97_23768 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700113110 | 764467579 | 2 | Female | WUGC | Oral | Supragingival Plaque | SRS062514 |
| 1237 | OTU_97_38024 | 6 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 |
| 1679 | OTU_97_26268 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700114603 | 763901136 | 2 | Female | WUGC | Oral | Palatine Tonsils | SRS062803 |
| 2156 | OTU_97_22614 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700106862 | 764811490 | 2 | Female | WUGC | Oral | Buccal Mucosa | SRS054624 |
| 2327 | OTU_97_24249 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700114057 | 764062976 | 2 | Female | WUGC | Oral | Supragingival Plaque | SRS065054 |
| 2545 | OTU_97_29375 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700021312 | 763536994 | 1 | Male | WUGC | Oral | Tongue Dorsum | SRS014271 |
| 2641 | OTU_97_12527 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700097643 | 158155345 | 2 | Male | BCM | Oral | Supragingival Plaque | SRS021934 |
| 3196 | OTU_97_38855 | 8 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700105162 | 612472597 | 1 | Female | WUGC | Oral | Supragingival Plaque | SRS053278 |
| 3414 | OTU_97_37239 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700014609 | 158438567 | 1 | Male | BCM,BI | Oral | Tongue Dorsum | NaN |
| 5628 | OTU_97_21855 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700111582 | 937495960 | 1 | Female | JCVI | Oral | Buccal Mucosa | SRS065316 |
| 5925 | OTU_97_25035 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700102250 | 159166850 | 2 | Male | BCM | Oral | Buccal Mucosa | SRS056480 |
| 6064 | OTU_97_29266 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700095963 | 160906640 | 1 | Male | JCVI | Oral | Hard Palate | SRS020598 |
| 6333 | OTU_97_9610 | 3 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700106498 | 553359145 | 1 | Female | WUGC | Oral | Buccal Mucosa | SRS050655 |
| 6999 | OTU_97_34341 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700114705 | 764831721 | 2 | Male | WUGC | Skin | Right Antecubital Fossa | SRS065474 |
| 7244 | OTU_97_39165 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700110298 | 892969023 | 2 | Female | JCVI | Oral | Throat | SRS043603 |
| 8016 | OTU_97_19766 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700095780 | 160825720 | 1 | Male | JCVI | Oral | Tongue Dorsum | SRS020484 |
| 8950 | OTU_97_38584 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700024302 | 764487809 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS015675 |
| 10338 | OTU_97_22344 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700021315 | 763536994 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS014277 |
| 10595 | OTU_97_38886 | 3 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700037961 | 765317243 | 1 | Female | WUGC | Oral | Buccal Mucosa | SRS019522 |
| 10747 | OTU_97_4255 | 2 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700014731 | 158458797 | 1 | Female | BCM,BI | Oral | Tongue Dorsum | NaN |
| 11973 | OTU_97_38892 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700024085 | 764083206 | 1 | Male | WUGC | Airways | Anterior Nares | SRS015450 |
| 12067 | OTU_97_3802 | 1 | Bacteria | Firmicutes | Bacilli | Lactobacillales | Streptococcaceae | Streptococcus | 700033012 | 159247771 | 1 | Female | JCVI,WUGC | Oral | Throat | SRS011567 |
That is all for the recap on conditional and logical operators. Back to flow control.
for loop iterator for DataFrames returns column names¶How do we print the first 10 observations from the GENUS column? We already have a number of routes to arrive at this solution but can we accomplish this using a for loop? Let's try the intuitive thing and just provide the DataFrame to the for loop first.
# Make a for loop by passing the first 10 rows of data
for genus in data[0:10]:
print(genus)
OTU count SUPERKINGDOM PHYLUM CLASS ORDER FAMILY GENUS PSN RSID VISITNO SEX RUN_CENTER HMP_BODY_SITE HMP_BODY_SUBSITE SRS_SAMPLE_ID
DataFrame column to iterate in a for loop¶No errors but this is not what we wanted.
Can you identify what is missing in the code? Our call managed to unpack all of the column names in data - not even just the first 10. Python has no idea what we are asking for so it defaults to printing the column names of a data frame.
So we definitely didn't provide Python with the code needed to interpret our intent. Would we be better off just providing a single column? Let's try.
# Break out a single column from data and see if it can iterate on that?
for i in data['GENUS'][0:10]:
print('Genus: ' + str(i))
Genus: Actinomyces Genus: Propionibacterium Genus: Porphyromonas Genus: Prevotella Genus: Porphyromonas Genus: nan Genus: Johnsonella Genus: Corynebacterium Genus: Fusobacterium Genus: nan
notna() method to further filter your for loop iterator values¶As you can see, providing a single column generates an iterator through the elements of the series. That works but it doesn't get us the first 10 valid entries from the GENUS column. Instead, we should filter out missing or NaN values with the notna() method.
To avoid NaN in the output, pass notna() to the data subsetting. At this point our code is getting a little long when perform the subsetting within the for loop. To reduce confusion, your could create a variable before adding it to the for loop.
In addition to the subsetting, sort the data alphabetically in increasing order (from A to Z) after selecting the first 10 values. Let's do this in a couple of steps.
# Make your for loop by filtering before retrieving an iterator
for i in data.loc[data['GENUS'].notna()]['GENUS'][0:10]:
print('Genus: ' + str(i))
Genus: Actinomyces Genus: Propionibacterium Genus: Porphyromonas Genus: Prevotella Genus: Porphyromonas Genus: Johnsonella Genus: Corynebacterium Genus: Fusobacterium Genus: Streptococcus Genus: Propionibacterium
The for loop below prints every genus that is not NaN. Now make the loop look better and easier to debug.
# easier to read when subsetting outside the loop
data_fil = data.loc[data['GENUS'].notna()]
data_fil.head()
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_14158 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 |
| 1 | OTU_97_14062 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium | 700114707 | 764831721 | 2 | Male | WUGC | Oral | Saliva | SRS062500 |
| 2 | OTU_97_6312 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700015167 | 158883629 | 1 | Female | BCM,BI | Airways | Anterior Nares | NaN |
| 3 | OTU_97_11576 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | 700113117 | 764467579 | 2 | Female | WUGC | Skin | Left Antecubital Fossa | SRS063079 |
| 4 | OTU_97_29218 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700103612 | 765094712 | 2 | Male | WUGC | Oral | Palatine Tonsils | SRS042263 |
# Don't forget to sort the values into alpha order!
for genus in data_fil['GENUS'][0:10].sort_values():
print('Genus: ' + genus)
Genus: Actinomyces Genus: Corynebacterium Genus: Fusobacterium Genus: Johnsonella Genus: Porphyromonas Genus: Porphyromonas Genus: Prevotella Genus: Propionibacterium Genus: Propionibacterium Genus: Streptococcus
DataFrame¶We've seen now that looping through a DataFrame based on column can be straightforward. We often, however, encounter data sets where we want to collate data from multiple columns. In that case, we would want to iterate by rows, through the DataFrame. Be forewarned: this can be both memory intensive and slow once your DataFrame is sufficiently large enough.
There are two methods we can use to proper create iterators of a DataFrame:
The iterrows() method will return a row as a Series object but this can be problematics as the data types must be converted into a single type. This can produce unpredictable or undesired results.
The itertuples() method is much faster and preferable to using iterrows().
We'll focus on using itertuples() to see how exactly we can use it to iterate over our rows. With each pass we'll sum from two variables in the DataFrame: "count" and "VISITNO".
# Create a variable to hold our results
summation = [0,0]
# Loop through
for row in data_fil.head(10).itertuples():
# Print some column values for the row
print(row.Index, row.count, row.GENUS)
# Produce some new values based on the values in the current row
summation[0] = summation[0] + row.count
summation[1] = summation[1] + row.VISITNO
summation
0 0 Actinomyces 1 0 Propionibacterium 2 0 Porphyromonas 3 0 Prevotella 4 0 Porphyromonas 6 0 Johnsonella 7 0 Corynebacterium 8 0 Fusobacterium 10 0 Streptococcus 11 0 Propionibacterium
[0, 15]
range() to iterate through a DataFrame¶Sometimes the simplest way to work through your DataFrame object, row-by-row, is by using their index position in combination with a range(). If you don't know what range you want to use, you can retrieve the dimensions of our DataFrame using the shape attribute.
While not quite as clean as using an iterator, it's certainly an approach that will work.
# Create a variable to hold our results
summation = [0,0]
# Loop through
for row_num in range(0, 5):
# Print some column values for the row
print(data_fil.iloc[row_num].filter(['count', 'GENUS']))
# Produce some new values based on the values in the current row
summation[0] = summation[0] + data_fil.iloc[row_num]['count']
summation[1] = summation[1] + data_fil.iloc[row_num]['VISITNO']
summation
count 0 GENUS Actinomyces Name: 0, dtype: object count 0 GENUS Propionibacterium Name: 1, dtype: object count 0 GENUS Porphyromonas Name: 2, dtype: object count 0 GENUS Prevotella Name: 3, dtype: object count 0 GENUS Porphyromonas Name: 4, dtype: object
[0, 8]
for loop, but should you?¶We've been having fun generating some code that let's us iterate through our objects but do remember that the there are built-in functions for calculating the simpler things in our data structures. Sometimes while we need the practice, it just a matter of efficiency - especially with large data sets.
Let's use a for loop to calculate some summary statistics on our filtered subset - just for practice.
# We can calculate summary values ourselves
total = 0
for t in data_fil['count']:
total = total + t
data_fil_mean = total / len(data_fil['count']) # Should this belong in the for loop?
print(data_fil_mean)
# Or just do this
print()
print("Calculate the mean with a method that already exists")
data_fil['count'].mean()
0.2393559221574632 Calculate the mean with a method that already exists
0.2393559221574632
# Count the total number of Male vs Female samples in our data_fil dataset
visitno_mean = 0
for value in ...
visitno_mean = ...
print(...)
for loops do not always have to be at the beginning of your code section. Instead we can build a for loop directly into a calculation if we want to use it to build a quick iterable for us to evaluate. This can take the form of
newlist = [expression for item in iterable if condition] where:
newlist is the variable where we want to save our new list.expression is the equivalent of the statement in our for loop. Something we want to use our iterator on.for item in iterable triggers the for loop to request an iterator that is assigned to item.condition is an optional filter upon which we want to subset our item. If there is no condition, then the entire iterable will be used to make the iterator.In the following example, we will calculate the standard deviation of bacterial counts by taking the squared root of variance.
$$\sigma = \sqrt{\frac{\sum(x_i - \bar{x})^2}{N}}$$We'll build up slowly to get a sense of what's happening.
# This syntax returns a generator to us - remember that's a type of iterator
type((value) for value in data_fil['count'])
generator
Now let's do something with our generator by using the pow(value, exponent) function to get the square of the difference between the value and the mean.
# Use the generator to generate the sum of the difference between values and the mean
# recall we calculated the mean of 'data_fil' earlier
sum(pow(value - data_fil_mean, 2) for value in data_fil['count'])
623601.0406006207
We now have the top half of our equation - which really takes care of the list comprehension part for us. We didn't need to filter the data but we could have included a condition like sum(pow(value - mean, 2) for value in data_fil['count'] if value > 0) which would alter our sum total (try it for yourself!)
Now we just need to divide by N and calculate the square root.
# Use the generator to generate some numbers
sum(pow(value - data_fil_mean, 2) for value in data_fil['count'])/len(data_fil['count'])
61.60239460640331
# calculate the variance
variance = sum(pow(value - data_fil_mean, 2) for value in data_fil['count'])/len(data_fil['count'])
# Calculate the standard deviation
stdev = np.sqrt(variance)
print(stdev)
7.848719297210425
Use Numpy's np.std() function to corroborate your result
np.std(data_fil['count'])
7.848719297210425
DataFrame of counts per microbe¶Let's take a closer look at our filtered dataset data_fil to generate total counts based on the unique values GENUS values. We'll use a combination of filtering and method chaining. At the end we'll incorporate our values with the zip() method which can combine tuples as columns to help us make a dataframe.
The zip() function will take in multiple iterators and match them together by index to create a series of tuples. Then an iterator of tuples will be returned.
Do you recall another way to do this from last lecture?
# Capture the unique genera
genera = []
# Capture the counts
sum_count = []
for genus in data_fil['GENUS'].unique():
# Build our list of genera
genera.append(genus)
# Create a sum of each genera
# Filter for genus, then pull down the count column, then sum
sum_count.append(data_fil[data_fil['GENUS'] == genus]['count'].sum())
# Combine our two lists together by matching indices
genus_count = pd.DataFrame(zip(genera, sum_count)).sort_values(by=1, ascending=False)
genus_count.head()
| 0 | 1 | |
|---|---|---|
| 7 | Streptococcus | 690 |
| 15 | Lactobacillus | 558 |
| 6 | Fusobacterium | 194 |
| 12 | Staphylococcus | 180 |
| 1 | Propionibacterium | 124 |
DataFrame¶So in the above, we generated a couple of extra variables, stored the results, and used the zip() function to combine them before making a DataFrame. Now we'll use list comprehension to do the same thing in a "single" line. Here we've broken it into a few lines for readability. Before we try to move ahead, let's break down the components:
for loop where we produce an iterator of unique genus names.genus to sum the count per genus.(genus, sum) where we match a genus with the sum of it's count variable.# A more complex example of list comprehension that foregoes a for loop
#
pd.DataFrame([(genus, sum(data_fil[data_fil['GENUS'] == genus]['count'])) for genus in data_fil['GENUS'].unique()]) \
.sort_values(by = 1, ascending = False) \
.head()
| 0 | 1 | |
|---|---|---|
| 7 | Streptococcus | 690 |
| 15 | Lactobacillus | 558 |
| 6 | Fusobacterium | 194 |
| 12 | Staphylococcus | 180 |
| 1 | Propionibacterium | 124 |
So in both of our previous examples we use the for loop in some way to iterate through a list to filter data before summarizing it. Of course we've seen there is an easier way to do this with the proper use of the groupby() method. Let's recount how that can work.
# We can replace the entire for loop and filter code with a simple groupby() call
data_fil.groupby(by="GENUS")[["count"]].sum().sort_values(by="count", ascending=False).head()
| count | |
|---|---|
| GENUS | |
| Streptococcus | 690 |
| Lactobacillus | 558 |
| Fusobacterium | 194 |
| Staphylococcus | 180 |
| Propionibacterium | 124 |
for loops are loops inside loops inside loops...¶We've already discussed the concept of nested objects: lists, arrays, dictionaries. A nested for loop is a similar idea: having loops running as statements within your loops. There's no real limit to how many for loops you can nest but if you're deeply nesting for loops, there may be better ways to accomplish your goal.
Now that we have the non-missing data in the form of data_fil, let's creates a similar cumulative sum of counts except on a per genus per body site basis. We'll change it up and build our results with a dictionary this time. It's really quite similar to what we had before except all of the results are stored in a single variable.
# You can make an empty dictionary with keys assigned to empty lists as a value
results_dict = {"GENUS": [],
"HMP_BODY_SITE": [],
"count": []}
# First or outer for loop
for i in data_fil['GENUS'].unique():
# Second or inner for loop
for j in data_fil['HMP_BODY_SITE'].unique():
# Build a dictionary with our three keys
# Each round we update each key in the dictionary
results_dict["GENUS"].append(i)
results_dict["HMP_BODY_SITE"].append(j)
results_dict["count"].append(data_fil[(data_fil['GENUS'] == i) & # filter for a genus
(data_fil['HMP_BODY_SITE'] == j) # filter for a specific site
]["count"].sum())
# convert the dictionary to a DataFrame using the keys as columns.
genus_site_count = pd.DataFrame.from_dict(results_dict, orient="columns")
# Take a peek at the data
genus_site_count.head()
# Get all the information from our new DataFrame
genus_site_count.info()
| GENUS | HMP_BODY_SITE | count | |
|---|---|---|---|
| 0 | Actinomyces | Oral | 101 |
| 1 | Actinomyces | Airways | 0 |
| 2 | Actinomyces | Skin | 0 |
| 3 | Actinomyces | Gastrointestinal Tract | 0 |
| 4 | Actinomyces | Urogenital Tract | 0 |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1105 entries, 0 to 1104 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GENUS 1105 non-null object 1 HMP_BODY_SITE 1105 non-null object 2 count 1105 non-null int64 dtypes: int64(1), object(2) memory usage: 26.0+ KB
# Gather the number of unique values in GENUS and HMP_BODY_SITE
data_fil["GENUS"].unique().size
data_fil["HMP_BODY_SITE"].unique().size
# How many rows should we produce looking at "GENUS" vs "HMP_BODY_SITE"
data_fil["GENUS"].unique().size * data_fil["HMP_BODY_SITE"].unique().size
221
5
1105
We can use the sleep() function from the time module to show in real time what the loop does. We'll just do it on a subset or our data though.
# Import the time module
import time
# You can make an empty dictionary with keys assigned to empty lists as a value
results_dict = {"GENUS": [],
"HMP_BODY_SITE": [],
"count": []}
for i in data_fil['GENUS'].unique()[:4]: # First or outer for loop
for j in data_fil['HMP_BODY_SITE'].unique()[:4]: # Second or inner for loop
# Build a dictionary with our three keys
# Each round we update each key in the dictionary
results_dict["GENUS"].append(i)
results_dict["HMP_BODY_SITE"].append(j)
results_dict["count"].append(data_fil[(data_fil['GENUS'] == i) & # filter for a genus
(data_fil['HMP_BODY_SITE'] == j) # filter for a specific site
]["count"].sum())
# Print the count as it updates
print(results_dict["count"])
# Pause the program for 0.5 seconds
time.sleep(0.5)
[101] [101, 0] [101, 0, 0] [101, 0, 0, 0] [101, 0, 0, 0, 0] [101, 0, 0, 0, 0, 40] [101, 0, 0, 0, 0, 40, 84] [101, 0, 0, 0, 0, 40, 84, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0, 0, 94] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0, 0, 94, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0, 0, 94, 0, 0] [101, 0, 0, 0, 0, 40, 84, 0, 36, 0, 0, 0, 94, 0, 0, 0]
for loop¶Based on the information we have, it appears that we produced exactly what we wanted: 221 unique genera and 5 unique body sites yielding 1105 total combinations. Not so fast though - do all of these combinations truly exist in our dataset? In fact there are only 533 combinations between these two sets. We can prove this by turning again to the groupby() method.
By using the nested for loop we produced combinations that don't exist within our dataset. The problem persists because we take the sum([]) of an empty object, which returns 0 as a value. Thus we end up filling all the values of our DataFrame whether or not they actually exist.
# Using the groupby will return only the combinations that TRULY exist in our dataset.
data_fil.groupby(by=["GENUS", "HMP_BODY_SITE"])[["count"]].sum()
| count | ||
|---|---|---|
| GENUS | HMP_BODY_SITE | |
| Abiotrophia | Airways | 0 |
| Oral | 0 | |
| Skin | 0 | |
| Urogenital Tract | 0 | |
| Acetobacter | Oral | 0 |
| ... | ... | ... |
| Weissella | Oral | 0 |
| Urogenital Tract | 0 | |
| Yonghaparkia | Urogenital Tract | 0 |
| p-75-a5 | Airways | 0 |
| Oral | 0 |
533 rows × 1 columns
| Although for loops are helpful, you may consider another direction if you're nesting too deeply. https://devrant.com/rants/2230569/ive-seen-people-do-more-than-4-as-well-though |
We use the term conditionals to denote logical expressions that specifically evaluate to True or False and are used to determine how a program will run. Which set of code will it run next? Will it terminate a loop? This is where we also get the idea of flow control or control of flow.
if statement executes when the conditional evaluates to True¶The purpose of the if control statement is pretty clear. if a condition is met (True), then execute a statement. The following is the general structure of if:
if condition:
statement
where condition can be a simple logical expression or a complex one involving many of the operators we've already covered. Let's give it a try.
a = 4
if a == 4:
print('yes, a is 4')
yes, a is 4
# Nothing will print since a = 4 from above
if a == 5:
print('yes, a is 4')
else statement executes when your if conditional evaluates to False¶In the above code we have not set any instruction for when the condition evaluates to False. In that case, the statement line is not evaluated and therefore nothing is printed.
Think of the else statement much like plan B. It allows us to provide a catch-all set of code to run in the case where our conditional has "failed". Let's update our general code structure:
if condition:
first_statement
else:
second_statement
Simple, right?
The following code adds a boolean column called abundant to a subset of data called subdata (just for computational efficiency). Every observation (row) where the microbial count is greater than 150 will be classified as "yes" for abundant and "no" otherwise.
We'll also revist or introduce two new concepts:
itertuples() which returns named tuples of values from our DataFrame. It allows us to iterate over rows as named tuples. *. *¶Up until now we've used this operator for multiplication and other purposes but when placed directly to the left of an iterable, it can help to unpack the elements for passing on as arguments for a function or passing along the elements of an iterator. Conversely, we can use it as part of a variable assignment to pack or repack an unspecified number of elements into a list as a single variable.
Let's practice with unpacking and packing before moving forward shall we?
# Make an example list
example_list = [1, 2, 3, 4, 5, 6]
# Print the list
print(example_list)
# Print the elements of the list in a single call
print(*example_list)
[1, 2, 3, 4, 5, 6] 1 2 3 4 5 6
# Break the list into multiple variables
first, *rest = example_list
print("first: ", first, " vs. the rest: ", rest)
first: 1 vs. the rest: [2, 3, 4, 5, 6]
# Iterate through a word by unpacking it
word = 'data'
it = iter(word)
print(*it)
d a t a
# run print(*it) again won't print anything and the it will need to be redefined in order to run print(*it) again
print(*it)
# Uncomment this code cell if you need to restart the book!
# import pandas as pd
# data = pd.read_csv('data/subset_taxa_metadata_merged.csv')
# data = data.sort_values('count', ascending = False)
# print(data.head())
# Make a copy of our DataFrame
# subdata = data.iloc[:100, :].copy()
else statement¶Okay let's put that else statement to use now that we can understand the following code.
# Make a copy of our DataFrame
subdata = data.iloc[:100, :].copy()
# Create a new column full of NaN values
subdata['abundant'] = np.nan
# Set the "abundant" value row by row
for index, *row in subdata.itertuples():
# if our counts are > 150 then consider it abundant
if subdata.loc[index, 'count'] > 150:
subdata.loc[index, 'abundant'] = 'Y'
# Otherwise it is not abundant ()
else:
subdata.loc[index, 'abundant'] = 'N'
# Take a look at the resulting DataFrame changes
subdata.head(10)
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | abundant | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_14158 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 | N |
| 1 | OTU_97_14062 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium | 700114707 | 764831721 | 2 | Male | WUGC | Oral | Saliva | SRS062500 | N |
| 2 | OTU_97_6312 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700015167 | 158883629 | 1 | Female | BCM,BI | Airways | Anterior Nares | NaN | N |
| 3 | OTU_97_11576 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | 700113117 | 764467579 | 2 | Female | WUGC | Skin | Left Antecubital Fossa | SRS063079 | N |
| 4 | OTU_97_29218 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700103612 | 765094712 | 2 | Male | WUGC | Oral | Palatine Tonsils | SRS042263 | N |
| 5 | OTU_97_20709 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | Ruminococcaceae | NaN | 700109458 | 161554003 | 2 | Male | JCVI | Oral | Attached Keratinized Gingiva | SRS057920 | N |
| 6 | OTU_97_36456 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | Lachnospiraceae | Johnsonella | 700023179 | 763638144 | 1 | Female | WUGC | Oral | Subgingival Plaque | SRS014554 | N |
| 7 | OTU_97_43181 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Corynebacteriaceae | Corynebacterium | 700024800 | 764366428 | 1 | Female | WUGC | Skin | Left Retroauricular Crease | SRS016174 | N |
| 8 | OTU_97_21748 | 0 | Bacteria | Fusobacteria | Fusobacteria | Fusobacteriales | Fusobacteriaceae | Fusobacterium | 700037686 | 765034022 | 1 | Female | WUGC | Airways | Anterior Nares | SRS019239 | N |
| 9 | OTU_97_10463 | 0 | Bacteria | Firmicutes | Clostridia | Clostridiales | NaN | NaN | 700109397 | 370425937 | 2 | Female | JCVI | Skin | Left Retroauricular Crease | SRS050217 | N |
elif statement if you have multiple conditions to check¶Getting the hang of if and else? Next in line is elif. Consider elif an intermediate between if and else. It's literally a portmanteau of else and if which means if you want to check for multiple possible scenarios - usually (but not necessarily) with an order of precedence, then you can use the elif statement to go through that checklist. Let's see how it works.
Based on the microbial count, add a column called treatment that will be either treatment_A, treatment_B, or No action depending on the microbe counts (for the sake of this exercise, let's assume that all microbes have pathogenic potential on humans).
Remember: if a conditional fails, the statement within it will not be executed!
# create an empty column. It cannot be empty so we are populating it with NaN
subdata['treatment'] = np.nan
# Start our for loop
for index, *row in subdata.itertuples():
# if counts are >= 300 give treatment_A
if subdata.loc[index, 'count'] >= 300:
subdata.loc[index, 'treatment'] = 'treatments_A'
# otherwise, if counts are >= 100 give treatment_B
# Is this ENTIRE conditional necessary?
elif (subdata.loc[index, 'count'] >= 100) & (subdata.loc[index, 'count'] < 300):
subdata.loc[index, 'treatment'] = 'treatments_B'
# By default remaining values must be less than 100
else:
subdata.loc[index, 'treatment'] = 'No action'
# Check our resulting DataFrame
subdata.head()
| OTU | count | SUPERKINGDOM | PHYLUM | CLASS | ORDER | FAMILY | GENUS | PSN | RSID | VISITNO | SEX | RUN_CENTER | HMP_BODY_SITE | HMP_BODY_SUBSITE | SRS_SAMPLE_ID | abundant | treatment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | OTU_97_14158 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Actinomycetaceae | Actinomyces | 700106936 | 147406386 | 1 | Male | WUGC | Oral | Attached Keratinized Gingiva | SRS048393 | N | No action |
| 1 | OTU_97_14062 | 0 | Bacteria | Actinobacteria | Actinobacteria | Actinomycetales | Propionibacteriaceae | Propionibacterium | 700114707 | 764831721 | 2 | Male | WUGC | Oral | Saliva | SRS062500 | N | No action |
| 2 | OTU_97_6312 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700015167 | 158883629 | 1 | Female | BCM,BI | Airways | Anterior Nares | NaN | N | No action |
| 3 | OTU_97_11576 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Prevotellaceae | Prevotella | 700113117 | 764467579 | 2 | Female | WUGC | Skin | Left Antecubital Fossa | SRS063079 | N | No action |
| 4 | OTU_97_29218 | 0 | Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | Porphyromonadaceae | Porphyromonas | 700103612 | 765094712 | 2 | Male | WUGC | Oral | Palatine Tonsils | SRS042263 | N | No action |
while loops when you want to iterate based on a condition¶while loops run "while" a condition continues to evaluate to True. At the start of each loop, the condition is re-evaluated before a decision is made. If a for loop and an if statement were to make a weird code-baby, the while loop would be it.
You can use the conditional to iterate in different ways like:
Let's experiment with how that works shall we?
# Build a list
x = [1, 2, 3, 4]
# while there is something in x
while x: # Recall: an empty object evaluates to False!
print(x...)
# What is -1 doing here? Reverse indexing except pop() already uses the end to remove items.
# Generate a counter
z = 1
# Use the counter in while loop
while ...:
# print the counter value
print(z)
# increase z by 1 with each iteration so the loop -eventually- breaks
z += 1
while conditional can evaluate to False¶In our first example, there is an eventual end to the list because we are permanently removing items. Therefore the conditional will evaluate to False when the list is empty (ie []).
Our second example, however, requires us to remember to increment our variable z. Since this is quite a simple loop it's not an issue as we always increment the value of z. In other cases with complex branching code with if and/or elif statements you must be careful to check that your conditional will eventually fail.
Let's try another example where we print only those rows from subdata where the conditions is to be either 'Streptococcus' or 'Lactobacillus'
# Here's the list of our criteria
strept_lactobac = ['Streptococcus', 'Lactobacillus']
i=0
# Make a variable to hold our results
strept_lactobac_list = []
# While go through the criteria list
while i < ...:
strept_lactobac_list.append(subdata[subdata['GENUS'] == strept_lactobac[i]])
i += 1
# print our list
strept_lactobac_list
# How long is the list?
len(strept_lactobac_list)
Now we have a list of two: "strept" and "lactobac". We can use a for loop to unlist them, then use Pandas' concat() to join them into single data frame
# Unlist into data frames and then join into a single data frame
list_b = [] # create empty list
for i in range(len(strept_lactobac_list)):
# Pull an element from the list
list_a = pd.DataFrame(strept_lactobac_list[i])
# Add that element into list_b
list_b.append(...)
# Concatenate that list_b to df_test
df_test = ... # Does this belong inside or outside the loop?
# Output the data
df_test
# What is this code really doing?
# Note, we already have a list of DataFrames so...
pd.concat(strept_lactobac_list)
next() function¶Recall from the lecture 03 appendix that for each iterator, we can use the next() function to retrieve the next item in the queue, recalling its place in the queue. This continues until the last element is evaluated and then the iterator is empty.
If you try to go past the last element, Python will provide a StopIteration error to let you know you've gone too far.
Let's practice with the next() function.
x = [1, 2, 3]
my_iterator = iter(x)
print(my_iterator)
print(...)
print(next(iter(my_iterator)))
print(next(iter(my_iterator)))
print(next(iter(my_iterator)))
break, and continue can interrupt loops¶Sometimes you may be looping through with a for or while loop when an unexpected condition occurs. Perhaps you wanted to error-proof your code or need to exit a loop based on internal conditions encountered while examining your data. Sometimes you may have a last-ditch conditional to prevent yourself from iterating too many times, or even endlessly.
When you need to explicitly exit a loop, you can use the break command. This will end the loop without further repetition.
Alternatively, you may have a long series of code that you don't want to even bother evaluating with more conditionals (to save on processing power for instance). You can end the current iteration of a loop and begin the next using the continue command.
Let's work through a few examples
# break the loop when we see the first 'r'
for letter in 'bioinformatics':
# Here's our conditional
if letter == 'r':
...
# Print the current letter
print ('Current Letter :', letter)
# This example breaks when var is equal to 5
var = 10
while var > 0:
# Print the current value of var
print ('Current variable value :', str(var))
# Decrement var by 1
var = ...
# Here's our conditional!
if var == 5:
break
print("Done!")
# An example of how to use continue
var = 10
while var > 0:
# decrement your variable first
var += -1
# Was the number even or odd?
if (var) % 2 == 1:
...
# only print even numbers
print ('Current variable value: ', str(var))
| That's right, you can nest control statements inside control statements of any kind really... |
That's our fourth class on Python! You've made it through and we've learned about a number of logical expression operators and how to apply them in loops and filtering data:
At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.7% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1700 hours the following day).
Soon after the end of this lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete chapters 3-5 (Logic, Control Flow and Filter, 1500 possible points; Loops, 1450 possible points; and Case Study, 1200 possible points) from the Intermediate Python course. This is a pass-fail assignment, and in order to pass you need to achieve a least 3112 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please take a screenshot of the summary. You'll see this under the "Course Outline" menubar seen at the top of the page for each course. It should look something like this:
| A sample screen shot for one of the DataCamp assignments. You'll want to combine yours into single images or PDFs if possible |
Submit the file(s) for the homework to the assignment section of Quercus. This allows us to keep track of your progress while also producing a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 13:59 hours on Thursday, February 17th to submit your assignment (right before the next lecture).
Revision 1.0.0: materials prepared by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1021H S LEC0140, 06-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Recall that dictionaries consist of key:value pairs and that unlike lists, they have no index. Instead, they are accessed by providing a matching key. When we provide a dictionary object to a for loop it will return an iterator to its hash/keys.
Let's revisit our amino acid dictionary from lecture 2.
dictionary_aminoacids = {"Alanine": {"Ala", "A", "GCA GCC GCG GCT"},
"Cysteine": {"Cys", "C", "TGC, TGT"},
"Aspartic acid": {"Asp", "D", "GAC GAT"},
"Glutamic acid": {"Glu", "E", "GAA GAG"},
"Phenylalanine": {"Phe", "F", "TTC TTT"},
"Glycine": {"Gly", "G", "GGA GGC GGG GGT"},
"Histidine": {"His", "H", "CAC CAT"},
"Isoleucine": {"Ile", "I", "ATA ATC ATT"},
"Lysine": {"Lys", "K", "AAA AAG"},
"Leucine": {"Leu", "L", "TTA TTG CTA CTC CTG CTT"},
"Methionine": {"Met", "M" "ATG"},
"Asparagine": {"Asn", "N", "AAC AAT"},
"Proline": {"Pro", "P", "CCA CCC CCG CCT"},
"Glutamine": {"Gln", "Q", "CAA CAG"},
"Arginine": {"Arg", "R", "AGA AGG CGA CGC CGG CGT"},
"Serine": {"Ser", "S", "AGC AGT TCA TCC TCG TCT"},
"Threonine": {"Thr", "T", "ACA ACC ACG ACU"},
"Valine": {"Val", "V", "GTA GTC GTG GTT"},
"Tryptophan": {"Trp", "W", "TGG"},
"Tyrosine": {"Tyr", "Y," "TAC TAT"}
}
# Supplying a dictionary will iterate through its hash
for aminoacid in dictionary_aminoacids:
print(aminoacid)
Now that we know we can get the key information in our for loop, we can use that much like we did with our list examples to iterate through the value information stored in the dictionary.
# Provide the hash back to your dictionary
for aminoacid in dictionary_aminoacids:
print(aminoacid,"::",dictionary_aminoacids[aminoacid])
Like the dictionary, you can also iterate through its attributes like the keys and values. Remember that we can use methods from the dictionary object to return this information for us. There are three methods we can use for this purpose: keys(), values(), and items().
# equivalent to providing just the dictionary object
for aminoacid in dictionary_aminoacids.keys():
print(aminoacid,"::",dictionary_aminoacids[aminoacid])
# Retrieve the values as an iterator
for info in dictionary_aminoacids.values():
print(info)
Or get the whole dictionary
# Retrive the key:value pairs as an iterator
for info in dictionary_aminoacids.items():
print(info)
Each key:value pair is printed as a tuple.
# What type of object is returned by dictionary.items()?
type(dictionary_aminoacids.items())
# What type of object is a single dictionary item?
type(list(dictionary_aminoacids.items())[0])
# The above code is to some extent equivalent to
# Retrive the key:value pairs as an iterator
for info in dictionary_aminoacids.items():
# Subset the tuple into its two components
print(info[0], "::", info[1])
for loop to assign multiple variables from your iterator¶Knowing that the item() method returns a tuple object from our dictionary - specifically with two elements, can we take advantage of that information? Rather than index the information from the tuple, let's try to assign multiple variables to the elements from our tuple with the for loop itself.
# Assign both a key and value from the each value in our iterator
for (key, value) in dictionary_aminoacids.items():
print(key, "::", value)
# What if we assign multiple values?
for (key, value, extra) in dictionary_aminoacids.items():
print(key, "::", value)
As you can see, trying to assign beyond the number of values available will result in an error.
nditer()¶From our previous examples with arrays, it looks like iterating through a 1D array seems pretty straight-forward. Iteration over 2D Numpy arrays, however, is slightly more complex than with 1D counterparts if you are re-arranging it on the fly.
"An important thing to be aware of for this iteration is that the order is chosen to match the memory layout of the array instead of using a standard C or Fortran ordering. This is done for access efficiency, reflecting the idea that by default one simply wants to visit each element without concern for a particular ordering. We can see this by iterating over the transpose of our previous array, compared to taking a copy of that transpose in C order." https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.nditer.html
What does all that mean? In simpler terms the iterator for an array uses the same order as it is stored in memory regardless of the shape the array may be in. Let's see how that plays out in practice
# Generate a 2x3 array
array_5 = np.array([[6, 7, 8],
[9, 10, 11]])
for x in np.nditer(array_5):
print(x)
# What does the transpose look like?
array_5.T
# How does that affect the iterator?
for x in np.nditer(array_5.T):
print(x)
# Use a copy of the transpose
for x in np.nditer(array_5.T.copy()):
print("transpose: ", x)
order parameter to override how iterator elements are made in nditer()¶See how the process of copying the array has re-arranged it's elements in memory as well?
You don't necessarily want to copy your objects every time you want to move through them after transposing or reshaping them. Instead you should look to the specific parameters of nditer() of which include the order parameter which takes on the values of:
C: C order, traverse horizontally.F: Fortran order, traverse vertically.A: Fortran order if all the arrays are Fortran contiguous; C order otherwise.K: As close to the order array elements appear in memory as possible (keep existing order, default behaviour)# Bring up array_5
array_5
# Fortran language order
for x in np.nditer(array_5, order='F'):
print(x, end=' ')
# Bring up array_5
array_5
# C language order
for x in np.nditer(array_5, order='C'):
print(x, end=' ')
# Look to a larger array
array_3.T
# Keep array order as it was when the array was created
for x in np.nditer(array_3.T, order='K'):
print(x, end=' ')
DataFrames¶Below you'll find some example code for subsetting DataFrame objects. Recall some of our rules involving subsetting DataFrame objects include:
DataFrame.loc[] and iloc[] methods can be used to subset both a row and column range, both of which are also amenable to slicing notation.loc[] and iloc[] methods accept boolean Series or conditional expressions or a mixture of both. All Series and expressions must have the same number of elements as there are rows in the DataFrame.DataFrames can be subset through a series of chain indexing ie [row][col][conditional_expression] but you will be returned a copy of the data and not access to the original.import pandas as pd
# Read in subset_taxa_metdata_merged.csv
data = pd.read_csv('data/subset_taxa_metadata_merged.csv')
data.head()
data.info()
# select based on row and column names
data.loc[:6,['GENUS']]
# select based on row and column indices
data.iloc[ :6 , [7]]
# Note how many rows we get back compared to loc[]
# call GENUS as an attribute of the data frame
data.GENUS[0:6]
# Chain-indexing by retrieving a column and then selecting rows
data[['GENUS']][0:6]
# slice of rows 0 to 4. No similar option available to subset columns
data[0:5]
# Use a logical expression to retrieve rows with count > 16
data.loc[(data["count"] > 16)][['GENUS']]
# Chain-index a logical, then pull a column, then specify rows
data.loc[(data.VISITNO < 2)]["GENUS"][0:6]
The next piece of code is not going to work. Can you tell why?
data.loc[(data.count < 2)]['GENUS'][:6]
# Has the same syntax than the previous code (data.loc[(data.VISITNO < 2)]['GENUS'][:6] )
# Because count is also a function. That is why you should avoid naming your variables as functions.
# this time it works by slightly changing the syntax
data.loc[(data['count'] < 2)]['GENUS'][:6]
# Try to multi-index to select
data.loc[(data['count'] < 2), 'GENUS'][:6]
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.